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Abstract 

Background: The new sequencing technologies enable to scan very long and dense genetic sequences, obtaining 
datasets of genetic markers that are an order of magnitude larger than previously available. Such genetic sequences 
are characterized by common alleles interspersed with multiple rarer alleles. This situation has renewed the interest 
for the identification of haplotypes carrying the rare risk alleles. However, large scale explorations of the 
linkage-disequilibrium (LD) pattern to identify haplotype blocks are not easy to perform, because traditional 
algorithms have at least @(n 2 ) time and memory complexity. 

Results: We derived three incremental optimizations of the widely used haplotype block recognition algorithm 
proposed by Gabriel etal. in 2002. Our most efficient solution, called MIG ++ , has only 0(n) memory complexity and, 
on a genome-wide scale, it omits >80% of the calculations, which makes it an order of magnitude faster than the 
original algorithm. Differently from the existing software, the MIG ++ analyzes the LD between SNPs at any distance, 
avoiding restrictions on the maximal block length. The haplotype block partition of the entire HapMap II CEPH dataset 
was obtained in 457 hours. By replacing the standard likelihood-based D' variance estimator with an approximated 
estimator, the runtime was further improved. While producing a coarser partition, the approximate method allowed 
to obtain the full-genome haplotype block partition of the entire 1000 Genomes Project CEPH dataset in 44 hours, 
with no restrictions on allele frequency or long-range correlations. These experiments showed that LD-based 
haplotype blocks can span more than one million base-pairs in both HapMap II and 1 000 Genomes datasets. An 
application to the North American Rheumatoid Arthritis Consortium (NARAC) dataset shows how the MIG ++ can 
support genome-wide haplotype association studies. 

Conclusions: The MIG ++ enables to perform LD-based haplotype block recognition on genetic sequences of any 
length and density. In the new generation sequencing era, this can help identify haplotypes that carry rare variants of 
interest. The low computational requirements open the possibility to include the haplotype block structure into 
genome-wide association scans, downstream analyses, and visual interfaces for online genome browsers. 



Background independent signals in genome-wide association studies 

Linkage disequilibrium (LD) is the non-random associa- (GWAS). In this regard, the r 2 was also considered in hap- 

tion of alleles at different loci and decays with increas- lotype block recognition algorithms [4,5]. Nevertheless, 

ing distance between loci [1]. High LD regions reflect the D' should remain the statistics of choice for LD model- 

the presence of chromosomal segments (haplotypes) that ing because of its more direct biological interpretation. It 

are transmitted from parents to offsprings more often reflects the history of recombination, mutation, and selec- 

than expected by chance. LD is traditionally assessed by tion events that cause some chromosomal segments to be 

the normalized D' coefficient [2] . The r 2 coefficient [3] less diverse than others and, therefore, influence the hap- 

is currently more commonly used than D' to identify lotype distribution. Moreover, it has been shown that r 2 

is not significantly more precise, accurate or efficient than 
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jy = 0, the two markers are independent (perfect equi- 
librium), while \D f \ = 1 indicates that no more than 
three of the four possible haplotypes are being observed 
in the sample (complete disequilibrium). In contrast, the 
range of r greatly depends on the allele frequencies and 
equals —1 or +1 only when the two markers have the 
same allele frequency. In such cases, |r| = 1 indicates 
that knowing the allele at one marker allows to determine 
the allele at the other marker (perfect disequilibrium). 
But when the two markers have very different allele fre- 
quencies, the interpretation of r 2 becomes difficult. This 
is especially relevant with the data generated by the new 
sequencing technologies, that allow genotyping markers 
over a very wide spectrum of allele frequencies. In such 
situations, the r 2 may fail to identify the correct relation- 
ship between nearby variants. In GWAS, this may lead to 
a wrong definition of the identified loci. 

Although, in the past, haplotype blocks have been 
mainly used to identify tag SNPs [7], a variety of other 
applications is possible with currently available data. 
Recently, analysis of exome-chip data has shown that 
within-gene LD-block distribution can be informative 
of the gene function and of the possible relationship 
between genes and specific groups of phenotypes [8]. 
Another application is the genome-wide haplotype associ- 
ation scan, which was successful in uncovering risk loci for 
coronary artery disease [9], Alzheimer's disease [10] and 
breast cancer [11]. So far, genome-wide haplotype associa- 
tion scans have been mostly performed based on fixed- or 
variable-width sliding window methods, which systemati- 
cally miss haplotypes that are longer than some specified 
maximal window widths. An efficient genome-wide hap- 
lotype block recognition could help overcome such lim- 
itations, thus enhancing the biological interpretation of 
the results. In the study of rare variants, where collapsing 
methods (mostly based on gene boundaries) are becom- 
ing increasingly popular [12], the availability of haplotype 
blocks at genome-wide level would allow to collapse vari- 
ants based on block boundaries, capturing inter-genic 
variants, and avoiding the problem to define the gene 
boundaries. Additional applications include downstream 
analyses of GWAS, such as pathway-based approaches, 
where statistics for multiple SNPs are summarized into 
gene-specific P-values, which are then employed for 
gene ranking [13]. In pathway-based analyses, SNP- 
to-gene mapping is typically based on SNP proximity to 
the gene boundaries. With this method, when a region 
is gene-dense, it may be problematic to assign SNPs to 
a single, specific gene. An LD-based assignment would 
overcome this limitation and increase the power of down- 
stream analyses [14]. In general, ignoring the LD struc- 
ture in downstream analyses of GWAS can result in the 
misinterpretation of the findings [15]. Popular genome 
browsers, such as the Ensembl [16] or UCSC [17], are 



suitable for visualizing the LD distribution over regions 
of interest. However, they only allow pairwise LD calcu- 
lation between markers at <500 kb distance from each 
other and do not provide any LD-block partition. With no 
predefined block partition, the visual assessment of such 
LD patterns might be influenced by investigators subjec- 
tivity. On the other hand, the 500 kb distance constraint 
may limit the investigation of larger strong-LD regions. 
With the availability of pre-calculated, threshold-free LD 
blocks, we would overcome both these limitations. 

There is extensive literature on haplotype block infer- 
ence [18-21], including methods based on probabilistic 
graphical models [22]. The latter allow an accurate iden- 
tification of SNP clusters, even in situations when SNPs 
are not necessarily contiguous. However, due to its sim- 
plicity, the most commonly used LD-based algorithm 
remains the one proposed by Gabriel et al. [23], which 
is implemented in Haploview [24]. The Haploview algo- 
rithm is widely used in genetic association studies and 
it is included in popular software, such as PLINK [25]. 
However, with a G(n 2 ) time and memory complexity, 
where n is the number of SNPs, the algorithm is applica- 
ble only to short genomic segments containing no more 
than a few thousand SNPs. Unless runtime and memory 
usage are artificially reduced by splitting large segments 
into smaller chunks, the algorithm cannot be applied to 
densely genotyped segments or genome-wide analyses. 

In this paper, we describe how we improved efficiency 
and scalability of the Haploview algorithm (1) by adopt- 
ing an incremental computation of the haplotype blocks 
based on iterative chromosome scans and (2) by estimat- 
ing D' confidence intervals (CIs) using the approximate 
variance estimator proposed by Zapata et al. [26]. The 
incremental computation strategy led to an algorithm, 
termed MIG ++ , that has 0(h) memory complexity and 
omits more than 80% of the pairwise LD computations, 
while obtaining exactly the same final haplotype block 
partition as Haploview. In contrast to Haploview, the 
new algorithm can consider pairwise LD between SNPs 
at any distance. With MIG ++ , we performed the hap- 
lotype block recognition of the entire HapMap phase II 
dataset of CEPH haplotypes. By introducing the approxi- 
mate variance estimator, the performance of the MIG ++ 
was further improved and allowed us to perform the block 
partition of the entire 1000 Genomes Project dataset of 
CEPH haplotypes. To show a practical application of the 
obtained genome-wide block partition, we finally com- 
pared SNP-based and haplotype block-based association 
tests in a GWAS context. 

Methods 

Haplotype block definition 

The haplotype block recognition algorithm proposed by 
Gabriel et al. [23] is based on \D'\ and its 90% CI, with 
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CL and CU being the lower and upper bounds of the 
CI, respectively. SNP pairs are classified as follows: (1) 
in strong LD if CL > 0.7 and CU > 0.98; (2) showing 
strong evidence of historical recombination [strong EHR) 
if CU < 0.9; (3) non-informative, otherwise. Informative 
pairs are those satisfying conditions (1) or (2). A haplotype 
block was then defined as follows: 

Definition 1. (Haplotype Block). Let C = {gi, . . . ,g n ) be 
a chromosome ofn SNPs, G = {gi, . . . ,gj) a region of adja- 
cent SNPs in C, I the number of strong LD SNP pairs in G, 
and r the number of strong EHR SNP pairs in G. Then, G is 
a haplotype block if 

(a) the two outermost SNPs, gi andgj, are in strong LD, 
and 

(b) there is at least a proportion d of informative pairs 
that are in strong LD, i.e.: 1/ (/ + r) > d. 

In their original work, Gabriel et al. [23] set d = 0.95 
after investigating the fractions of strong LD SNP pairs 
in genomic regions of different length and in different 
populations. 

The Haploview algorithm [24] performs a haplotype 
block partition in two steps: (1) all regions satisfying 
Definition 1 (a) are collected in a set of candidate hap- 
lotype blocks; (2) from this set of candidates, a subset 
of non-overlapping regions that satisfy Definition 1 (b) 
is selected. In the first step, the entire chromosome is 
scanned and, for every SNP pair, the \D f \ CI is computed 
and stored in an n x n matrix. The matrix is then tra- 
versed to identify the pairs that satisfy Definition 1 (a). 
These pairs mark regions of different length that are can- 
didates to become haplotype blocks. In the second step, 
the candidate regions are sorted by decreasing length and 
processed starting with the largest one. If a region satis- 
fies Definition 1 (b), it is classified as a haplotype block, 
and all other overlapping candidate regions are discarded. 
Regions not satisfying Definition 1 (b) are skipped. This 
process continues with the next largest candidate region, 
until the candidate set is completely processed and the list 
of haplotype blocks is complete. 

The overall complexity of the algorithm is mainly deter- 
mined by the first step. More specifically, the G(n 2 ) time 
and memory complexity is due to the computation and 
maintenance of the n x n CI matrix. For this reason, we 
concentrated our improvements on the first step. 

Incremental computation of haplotype blocks 

The core ideas of our optimizations are to compute haplo- 
type blocks incrementally and to omit, as soon as possible, 
regions that cannot be extended to larger blocks due to 
an insufficient proportion of strong LD SNP pairs. In this 



way, we avoid both unnecessary computations and the 
storage of an n x n CI matrix. The incremental haplotype 
block computation is based on the concepts of a SNP-pair 
weight and a region weight described below. 

Definition 2. (SNP-pair weight). Let C and d be as 

defined in Definition 1. For a given pair of SNPs gi and gp 
the SNP-pair weight, w(i,j), is defined as follows: 



w(i,j) = 



1 — d if gi and gj are in strong LD, 
—d if gi and gj show strong EHR, 
0 otherwise. 



Definition 3. (Region weight). Let G be as defined in 
Definition 1. The region weight of G, w(i,j), is defined as 
the sum of all SNP-pair weights in G: 



1 v 



v=i+l u=i 

The following theorem defines a haplotype block based 
on the region weight. 

Theorem 1. Let G be as defined in Definition 1. G is a 
haplotype block ifw (i,j) = 1 — d and w{i,j) > 0. 

Proof. From Definition 2, if SNPs gi and gj are in 
strong LD, then w (i,j) = 1 — d. Therefore, Definition 1 (a) 

is satisfied. G contains S = Ylv=i+i Y7u=i 1 possi- 
ble SNP pairs, of which / are in strong LD, r show 
strong EHR, and the remaining ones are non-informative. 
From Definitions 2 and 3, it follows that w(i,j) = 

EL+i Y. V u =i ™ (u, v) = 1(1 ~d) + r(-d) + (S - 1 - r)-0 = 
l-d(l+r)Afw(i,j) > 0,then/-rf(/+r) > 0 //(/+r) > 
d. Therefore, Definition 1 (b) is also satisfied. □ 

Theorem 1 is the basis for the incremental haplotype 
block reconstruction, which is the core of our optimiza- 
tions. In the following, we present three gradual improve- 
ments of the Haploview algorithm: a memory-efficient 
implementation based on the Gabriel et al. [23] defini- 
tion (MIG); MIG with additional search space pruning 
(MIG + ); and MIG + with iterative chromosomal process- 
ing (MIG ++ ). Theorem 1 ensures that all three algorithms 
produce block partitions that are identical to the original 
Haploview results. 

The MIG algorithm 

For a given chromosomal segment C containing n SNPs, 
the maintenance of an n x n matrix containing all the \D f \ 
CIs can be avoided by storing n region weights in a uni- 
dimensional vector W nx i. In each element of W, W[i], we 
store the weight of a chromosomal region that starts at 
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SNP gi> When the region is enlarged by including addi- 
tional SNPs to the right of gu the weight W[i] is updated 
accordingly. This procedure, illustrated in Figure 1, begins 
with setting all the weights to 0. At the initial stage, 
the vector W represents all one-SNP regions. Then, the 
region starting at SNP g\ is enlarged by including the next 
SNP, gi* Therefore, starting from gi, chromosome C is 
processed one SNP after the other, from left to right. 
For a SNP gj, with j > 2, all SNP pair weights 
i = j — 1, . . . , 1, are computed and added up as s = 
w(j- 1,;) H \-w (/,/). 

5 and W[i] are updated for every computed weight 
w(i,j). Before the update, 5 = w(j — l,y) + • — h w(i — l,y) 
and W[i] contains the region weight w(i,j — 1), which was 
already computed for the previous SNP gj-\. Then, s is 
incremented by w(i,j) and W[i] is incremented by the new 
value of 5. W[i] now represents the region weight w(i,j), 
i.e., w(i,j) = w(i,j — 1) + w(j — l,y) H h w(i,j). When- 
ever w(i,j) = 1 — d and w{i,j) > 0, Theorem 1 is satisfied 
and the region {gi, . . . ,gj) is added to the set of candidate 
haplotype blocks. This procedure is repeated with the next 
SNP, gj+\. An example of the first three computational 
steps is given in Figure 2. The pseudocode is provided in 
Algorithm A.l (Additional file 1). 

MIG reduces the memory complexity from B(n 2 ) to 
G(n). Moreover, instead of identifying candidate regions 
that satisfy only Definition 1 (a) (as in Haploview), MIG 
checks immediately both conditions (a) and (b). This 
yields a smaller set of candidate blocks, and therefore indi- 
rectly speeds up also the second step of the Haploview 
algorithm. 

The MIG + algorithm 

While MIG drastically reduces the memory requirements 
by avoiding the maintenance of the CI matrix, it still 
computes weights for all SNP pairs, totaling n(n — l)/2 



computations as in Haploview. To omit unnecessary com- 
putations, we apply a search space pruning to the MIG 
algorithm to identify regions that cannot be further 
extended to form a haplotype block. The pseudocode is 
shown in Algorithm A. 2 (Additional file 1). 

Instead of computing weights for all pairs of SNPs, 
only weights w(j — 1,/), . . . , w{b,j) are computed, where 
b = min({i \ 1 < i < j A w max (i,j) > 0}) and 
WmaxiUj) = max{w(i,k) \ j < k < n}. The function 
Wmax(Uj) is an upper bound for the weight of all regions 
(gi> • • ->gj>- - ->gk) that start at# and end after gj, i.e., those 
extending beyond the region (g if . . . ,gj). If w max (i,j) < 0 
for some /, none of the regions (gi,...,gk) can satisfy 
Definition 1 (b). The smallest z, that can be a poten- 
tial starting point of a region with a positive weight, can 
therefore be set as breakpoint b. Regions starting left of 
b and stopping right of / receive negative weights and are 
discarded (Figure 3, left panel). 

The upper bound, w ma x(i>j)> is estimated assuming that 
all unprocessed SNPs to the right of gj are in strong LD 
with each other and with all SNPs in the region (g it ... t gj). 
Then, w(i, k) < w(i,j) + (1 — d) • 5, where w(i,j) is already 
computed and S = ((j—i+l) + (k—i))(k—j)/2 is the num- 
ber of unprocessed SNP pairs. Since S is largest for the 
longest region (g it . . . 1 g n ) 1 we have max{w(U k) \ k > ;} < 
w{Uj) + (1 - d) • ((/ - / + 1) + (n - i))(n - j)/2, and the 
estimated upper bound w max {i,j) is defined as follows: 

Wmaxihj) = w(i>j)+(l-d)-((j-i+l)+(n-i))(n-j)/2. 

The MIG + algorithm performs at most kn(n — l)/2 
computations, where X, 1 — d < X < 1, depends on 
the data. The worst case of k = 1 occurs only in the 
unlikely situation when a few very large blocks span an 
entire chromosome. 




s = w(j - 1, j) H \-w(i,j) 

1 



C = (#i 9i9i-\ 9j 



9* ) 



W 



3-1 
3 



0 



w(ij) \ 

Regions under j Regions to j 
processing be processed 

Figure 1 Processing a chromosome with the MIG algorithm. A chromosome is processed from left to right, and gj is the current SNP. Values in 
1/1/ [/'] correspond to the region weights w(i,j) of {g- h ...,gj).s accumulates SNP-pair weights w(j - 1 J), w(i,j). 
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Figure 2 The first three computational steps of the MIG algorithm. The vector 1/1/ and the variable s are initialized to 0. The processing starts at 
gj with the analysis of the region {g^.gj)- The SNP-pair weight w(1,2) is computed and added to s. Then, the region weight w(1,2) is incrementally 
computed as 1/1/[1] +s and stored in 1/1/[1] (by replacing the old value). Next, SNP g^ is processed. After initializing s = 0, weight w(2,3) is computed 
and stored in s. w(2, 3) is incrementally determined as W[2] +s, and stored in W[2]. Then, weight w(1 , 3) is computed, and 
w(1,3) = w(],2) + w(2,3) + w(],3) = 1/1/[1] +s.The next SNP to the right, g^, is processed in a similar way. 



The /W/G ++ algorithm 

A limitation of the MIG + algorithm is its blindness about 
the unprocessed area to the right of the current SNP 
gj. Assuming strong LD for all SNP pairs in this area 
results in a conservative upper bound, w max (hj)> f° r the 
region weights. An additional optimization step allows to 
obtain a more precise estimate of w max (i,j) and further 
prunes unnecessary computations. The pseudocode of the 
modified algorithm, MIG ++ , is given in Algorithm A. 3 
(Additional file 1). 



The improved algorithm is an iterative procedure that, 
at each iteration, scans the chromosome from left to right 
and computes the weights only for a limited number of 
SNP pairs. For a SNP gj, the SNP pairs considered in 
an iteration are restricted to a window of size win: only 
the weights w(j—l,j), . . . , w(t,j) are computed, where t = 
max({&,y— win}) and 1 < win < n (Figure 3, right panel). 
At each new iteration, the window size is increased by 
a number of SNPs equal to win. Therefore, the number 
of computed SNP-pair weights increases proportionally. 



MIG+ 




w(i,j) 



w(i, k) 



MIG++ 



win 




*I * 



w(i, k) 



Wmax(hj) = max{w(i,k) | k > j} w m ax(i,j) = max{w(i,k) | k > j} 

Figure 3 Processing a chromosome with MIG+ and MIG ++ . w max (i,f) is computed taking into account the SNP-pair weights computed at the 
previous stages (colored in gray). MIG + : since w max (i,j) < 0 for any region that starts before the SNP g^ and ends after the SNP gj, then the SNP-pair 
weights within the hatched area are omitted from computation. MIG ++ : the SNP-pair weights within the hatched area are omitted from 
computation if w max (i,j) < 0 for any region that starts before the SNP g t and ends after the SNP gj, or g t is out of the window win (in the latter case 
the omitted SNP-pair weights may be used in the next iteration). 
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This allows a more precise estimation of the upper bounds 
for the region weights with every new iteration. 

By considering all SNP-pair weights computed in all 
previous iterations for the estimation of the upper bound, 
WmaxiUj), the algorithm requires linear time for each 
individual SNP pair to sum up all weights inside the cor- 
responding region. We use a computationally cheaper 
constant-time solution, though it may lead to a less accu- 
rate estimation. Since w(i,k) < w(i,j) + w(l,k) — iv(l,;), 
we have max{w(/, k) \ k > j] < w(i,j) + max{w(l, k) \ 
k > j] — w(l,j). An upper bound w max (i,j) can then be 
computed as follows: 

Wmaxihj) = w(Uj) + max{w(l, A:) | k > ;} - w(l,j). 

max{u>(l, k) | k > j} is computed in linear time after every 
scan of the chromosome, whereas w(l,j) is computed 
in constant time. Thus, the computation of the upper 
bound w max (i,j) for each individual SNP pair requires 
only constant time. 

When win = n, MIG++ is identical to MIG+. When 
win = 1, the number of iterations becomes too large, in- 
troducing a significant computational burden. We pro- 
pose to set win = \(n — 1)(1 — d)/2\, that corresponds to 
1 — d percent of all SNP pairs, which is the minimal frac- 
tion of SNP pairs that must be considered before one can 
be sure that an n-SNP segment is not a haplotype block. 

The MIG ++ performs at most Xn(n — l)/2 computa- 
tions, where X, 1 — d < X < 1, depends on the data. 
However, the value of X obtained with the MIG ++ algo- 
rithm is expected to be always smaller than that from the 
MIG + algorithm, because of the more precise estimation 

Of Wmax(i>j)- 

Alternative methods to estimate the D' CI 

A critical step of the Gabriel et al. [23] approach is 
the estimation of the D' CI. In a genome-wide context, 
this calculation can be repeated hundreds of millions of 
times. In Haploview, the CIs are obtained by means of 
the likelihood-based procedure proposed by Wall and 
Pritchard [27], which requires from 100 to 1,000 itera- 
tions. This method can be replaced with a computation- 
ally cheaper solution, based on an approximated estimator 
of the D f variance, as proposed by Zapata et al. [26]. 
This solution would make the whole block recognition 
algorithm significantly faster. 

The wall and pritchard (WP) method 

The true allele frequencies of each SNP are assumed to 
be equal to the observed allele frequencies. The like- 
lihood of the data in the four-fold table obtained by 
crossing any SNP pair, conditional to the \D f \ value, can 
be expressed as / = P(data\\D f \). I is evaluated at each 
value of \D'\ = 0.001 x p, with p = 0, 1, ... , 1000. CL 
is defined as the largest value of \D f \ such that 



Y^=o ^(01 X^=o° — a > wnere ol is the significance 
level. Similarly, CU is defined as the smallest value of \D' \ 

such that e^°+i mi E'=°o° m < «. 

The approximate variance (AV) method 

Consider two SNPs, u and v, with alleles {u\,U2\ and 
{vi, V2}, respectively. Let n UiVj and f u . v . denote, respectively, 
the absolute and relative frequencies of the four possi- 
ble haplotypes, UiVj(i,j e {1,2}), with/ M . and/ V; being 
the marginal frequencies of the two SNPs. In total, N = 
J2 n uivj haplotypes are observed. Zapata et al. [26] showed 
that the variance of D' can be approximated as follows: 

V0) « ((1 - 1^1) x (N • V(D) - \D'\D max 
x if u Ji+fuj2-2\D\)) 
+ |D' 1/3(1 -f 3 ))/(N.D 2 max ), 

where D' = D/D max ;D = f Uin - f Ul f n ; D max is 
min{/ Ml (l -f n ),(l - fuOM when D > 0 or min 
UuJn* (1 -fui)(± -A)} whenD < 0;/i isf n whenD' > 0 
or f V2 when D' < 0; fz is f V2 when D' > 0 or f n when 
D' < 0;f 3 isf Uin ,f UlV2 ,f U2Vl , zndf U2V2 whenD max is f u J n , 
fujwfujvv and f U2 f V2 , respectively; and 

V(D) » {f u jujvjv 2 +D(f U2 -f Ul )(fv 2 -f Vl )-D 2 ) /N. 

When jy = ±1, then V(D') = 0. The 1 - a CI of D' 
is equal to D' ± Z a /2\/V(P f ), where Z a /2 is the 1 — a/2 
percentile of the standard normal distribution. 

Experimental evaluation 

The experimental evaluation was based on the phased 
CEPH genotypes included in the HapMap phase II 
(HapMapII) [28] and the 1000 Genomes Project phase 1 
release 3 (1000G) [29] databases. The HapMapII dataset 
included 2,543,857 SNPs from 120 haplotypes (60 indi- 
viduals) and the 1000G dataset included 10,858,788 SNPs 
from 170 haplotypes (85 individuals). 

To compare the new algorithms to the standard 
Haploview, in terms of runtime and memory usage, the 
ideal solution would have been that of randomly sampling 
regions with different characteristics from the HapMapII 
or 1000G datasets. However, the Haploview algorithm was 
so computationally expensive that it prohibited to con- 
sider a sufficiently large number of random regions and, 
therefore, to obtain a representative sample of all pos- 
sible scenarios over the whole genome. For this reason, 
we selected the regions such that the most extreme sce- 
narios, in terms of median SNP minor allele frequency 
(MAF) and median inter-SNP distance, were covered. To 
identify such representative regions, we performed the 
systematic scan of all SNPs in the genome using a slid- 
ing window of 1,000 SNPs, after removing chromosomal 
centromeres and the HLA region. For each sliding region, 
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the median MAF and inter-SNP distance were recorded 
(Additional file 1, Figure A.l). All regions were then rep- 
resented in a two-dimensional Euclidean space, where the 
normalized inter-SNP distance was plotted against the 
normalized median MAF (Additional file 1, Figure A. 2). 
A total of nine regions were chosen for the experiments: 
the eight regions located on the outermost boundaries of 
the Euclidean space and the region closest to the center of 
the space. These regions represent scenarios with extreme 
and moderate median MAF and median inter-SNP dis- 
tance. The procedure was repeated using larger sliding 
windows of 5,000 to 30,000 SNPs. If not stated otherwise, 
in the experimental results we report median values over 
the nine regions for every different window size. 

The block partitions obtained with the WP and AV 
methods for D' CI estimation were compared in terms 
of total number of blocks, median number of SNPs per 
block, proportion of SNPs clustered into blocks, and 
median within block haplotype diversity. Haplotype diver- 
sity [19,20] is defined as the ratio between the number of 
common haplotypes and the total number of haplotypes 
within a block. Common haplotypes are those occurring 
more than once. The haplotype diversity index ranges 
from 0 (complete diversity) to 1 (no diversity). 

The three MIG algorithms were implemented in C++. 
To guarantee a fair comparison, the original Java imple- 
mentation of the Haploview algorithm was rewritten in 
C++, too. By default, Haploview considers only SNP pairs 
within a maximal distance of 500Kbp. We removed this 
constraint because it could affect the block partitions of 
very wide regions. For the WP method, we set the number 
of likelihood estimation points to 1,000 (100 in the orig- 
inal Haploview implementation). We didn't consider the 
population specific two-, three-, and four-marker rules, 
proposed by Gabriel et al [23] when very short regions 
are processed, because they have no impact on the com- 
putational efficiency of the algorithms. All experiments 
were run on a machine with an Opteron 8356 Quad Core 
(2.3GHz) CPU. 

Genome-wide association study of rheumatoid arthritis 

We applied our haplotype block partitioning algorithm 
to the genome-wide association study of the North 
American Rheumatoid Arthritis Consortium (NARAC) 
dataset. Data consisted of 868 cases and 1,194 controls. 
The samples were genotyped at 544,917 autosomal and 
sex chromosome SNPs. Quality check was performed 
with PLINK 1.07 [25]: we excluded 5,422 SNPs with a 
call rate of <90%, 11,327 SNPs with a minor allele fre- 
quency of <0.001, and 898 SNPs because of significant 
deviation from Hardy- Weinberg equilibrium in controls 
(p-value < 10 -6 ). No samples were excluded because of 
low call rate (<90%); 2 cases and 5 controls were removed 
because of sex mismatch; 1 case and 8 controls were 



additionally excluded after population stratification test 
based on principal component analysis performed with 
EIGENSOFT 5.0.1 [30]. After the quality control, 514,539 
autosomal SNPs and 2,046 samples were available for 
analyses. 

Haplotypes were phased using SHAPEIT version 2 [31]. 
To achieve good accuracy, we set 400 conditioning states 
per SNP. Recombination rates were taken from HapMap 
phase II build 36 and effective population size was set 
to 11,418 (as suggested for CEU populations). The esti- 
mated haplotypes were submitted to MIG ++ and pro- 
cessed with the WP and AV methods. We obtained 98,979 
WP blocks, covering 445,832 SNPs, with 68,707 single- 
ton SNPs outside of any block. The AV method iden- 
tified 97,816 blocks, covering 446,170 SNPs, and 68,369 
singleton SNPs. 

The genome-wide association scan was based on a logis- 
tic regression model adjusted for sex and the top 10 
eigenvectors obtained from EIGENSOFT 5.0.1 [30]. The 
association between disease status and individual SNPs or 
haplotype blocks was tested with a likelihood ratio test 
using PLINK 1.0.7 [25] with the logistic-genotypic and 
omnibus options, respectively. Within each block, hap- 
lotypes with frequency of <0.01 were collapsed together 
to preserve power. Singleton SNPs outside blocks were 
treated as in the SNP-based analysis, therefore produc- 
ing analogous results. Genomic control (GC) correction 
was applied to both SNP- and block-based GWAS results. 
Bonferroni-corrected significance thresholds were set to 
2.98 x 10 -7 for analysis based on the WP block parti- 
tion (i.e. 0.05 divided by the sum of 98,979 WP blocks and 
68,707 singleton SNPs), 3.01 x 10" 7 for the analysis based 
on the AV method (i.e. 0.05 divided by 97,816 AV blocks 
plus 68,369 singleton SNPs), and 9.17 x 10" 8 (i.e. 0.05 / 
514,539) for the individual SNP analysis. 

Results 

Runtime and memory with the WP method 

Figure 4 shows runtime and memory performance of 
Haploview and the three MIG algorithms based on the 
WP method, when applied to the 1000G dataset. Since 
both Haploview and MIG perform n(n—l)/2 computa- 
tions, it was expected to see identical runtime: both of 
them took 80 hours to process regions of 30,000 SNPs. 
However, MIG used three orders of magnitude less mem- 
ory than Haploview (3 MB vs. 7 GB). The runtime was 
significantly reduced with MIG + (27 hours) and even 
further with MIG ++ (14 hours). The runtime difference 
between algorithms increased with the region size (num- 
ber of SNPs). Memory usage was identical for MIG and 
MIG + , whereas MIG ++ required slightly more memory 
to store the computational status between iterative region 
scans. Similar results were obtained on the HapMapII 
dataset (results not shown). 
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Figure 4 Performance of the algorithms with the WP method, when applied to the 1000G dataset. MIG + and MIG ++ show substantially 
better runtime than MIG and Haploview. All versions of the MIG algorithm significantly outperforms Haploview in memory usage and, therefore, are 
able to handle densely genotyped and genome-wide data. 



The MIG ++ omitted more unnecessary computations 
than MIG + , which is reflected by the smaller k coeffi- 
cient in both HapMapII and 1000G datasets (Figure 5). 
The k values decreased with increasing number of SNPs 
in the region. When increasing the region size, after a 
rapid decline for small regions, k reached stable values 
with both MIG + and MIG ++ algorithms and in both 
datasets. This behavior relates to the LD decay with dis- 
tance. In regions of 30,000 SNPs in the 1000G dataset, 
the MIG ++ algorithm was able to omit ~80% of the cal- 
culations (A~0.20), while MIG+ could omit ~60% of the 
calculations (A~0.40). An example of the reduction of 
the number of calculations is given in Figure 6, where 
MIG + and MIG ++ are compared to Haploview, which is 
represented by the entire triangle. 

Runtime and memory with the AV method 

When we introduced the AV method to estimate the D' 
CI, we observed a drastic reduction of the computational 
time of the MIG algorithm. With the AV approach, the 
median runtime needed to analyze sequences of 10,000 
SNPs in the 1000G dataset was of 2 minutes. The same 
analysis took a median of 8.7 hours with the WP method 
(Figure 7, left panel). Proportional time reduction was 
observed for MIG + and MIG ++ . Similar results were 
obtained in the HapMapII dataset (results not shown). 



We observed that the introduction of the AV method 
caused a slight increase of the k coefficient (Figure 8). This 
is because, with the AV method, more SNP pairs are classi- 
fied to be in strong LD. This causes an increase of the num- 
ber of possible configurations to be checked and results 
in a larger set of candidate haplotype blocks. With the AV 
method, the MIG algorithms identified tens of millions of 
candidate haplotype blocks (Additional file 1, Figure A.3). 
The number of candidate blocks was even larger when the 
AV method was applied directly to Haploview, where can- 
didate blocks need to satisfy only Definition 1 (a). This 
significantly larger number of candidate blocks explains 
the increase in runtime of Haploview when using the AV 
method: for regions of 5,000 SNPs in the 1000G dataset, 
the median runtime was of 451 hours with the AV against 
the 2 hours with the WP method (Figure 7, right panel). 

Block Partitions with the WP and AV Methods 

The characteristics of the different block partitions 
obtained with the WP and AV methods are summarized 
in Figure 9. The AV method produced a smaller num- 
ber of blocks than the WP method (top panels). The 
median number of blocks per region increased along with 
the number of SNPs, and it increased faster for the WP 
compared to the AV method. Considering the median 
number of SNPs per block, the AV method produced 
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Figure 5 Values of the X coefficient for MIG + and MIG + with the WP method. MIG + omits more SNP-pair weight computations than MIG + , 
which is reflected by the smaller A. coefficient. 
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Figure 6 LD heatmap of region chr20:31 ,767,872-33,700,401, which contains 1,000 polymorphic SNPs in the HapMapll dataset. The area 
under the dashed line corresponds to all possible SNP pairs. The computed SNP-pair weights are indicated in red, with higher red intensity 
corresponding to higher |D'| values. The non-colored area corresponds to the SNP pairs omitted from computations. The solid line indicates the 
final haplotype block partition. It can be observed that, while the final block partition is identical for the two algorithms, more computations have 
been omitted by MIG+ than MIG+. 



larger blocks than the WP method (middle panels). For 
very short regions (e.g., 1,000 SNPs) both methods gen- 
erally induced larger blocks. This is because such small 
regions might be completely covered by a single or very 
few haplotype blocks. The median number of SNPs per 
block decreased along with the increase of the length of 
the region considered. Overall, the AV method assigned a 
higher percentage of SNPs to blocks compared to the WP 
method, which left more singleton SNPs outside of any 
block (bottom panels). In the analysis of the HapMapll 
dataset, 98.4% of the SNPs were clustered within blocks 
with the AV method and 90.5% with the WP method. In 
the analysis of the 1000G dataset, the percentages were of 
99.7 and 86.8, respectively. 

We observed that 100% of the blocks identified by the 
WP method were overlapping blocks identified by the 
AV method. More specifically, 80% to 90% of the blocks 
based on the WP method were completely included within 
blocks based on the AV method (Figure 10). The remain- 
ing 10% to 20% of WP blocks whose borders were cross- 
ing borders of AV blocks, could be entirely attributed 
to the selection mechanism in the step (2) of the algo- 
rithm, when larger candidate blocks are prioritized over 



the shorter ones. In fact, when, instead of looking at the 
final block partition, we focused on the intermediate set 
of candidate blocks before the final pruning, we observed 
that 100% of the candidates from the WP method were 
entirely included within the candidate blocks from the AV 
method. 

Consistently with the findings of larger AV blocks, we 
observed a generally higher haplotype diversity in the 
partitions obtained with the AV method compared to 
the WP method (Figure 11). For instance, when consid- 
ering regions of 30,000 SNPs in the 1000G dataset, we 
observed median within-block haplotype diversity indices 
of 0.876 and 0.982 with the AV and WP methods, respec- 
tively. Slightly higher diversity indices were observed in 
the HapMapll dataset: 0.975 and 0.992 for the AV and 
WP methods, respectively. The within-block diversity was 
more variable in short than in long regions because, as 
observed above, when regions are too small, then it might 
be difficult to identify more than one block. 

Whole genome haplotype block recognition 

The linear memory complexity and the significant reduc- 
tion of the number of computations allowed us to run 
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Figure 7 Impact of the WP and AV methods on runtime, when applied to the 1 000G dataset. The AV method had negative impact on 
Haploview runtime: the processing was significantly slower compared to the WP method. However, when combined with MIG, the AV method 
reduced the computational time from hours to seconds, compared to the WP method. 
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Figure 8 Values of the X coefficient for MIG + : comparison between WP and AV methods. When using the AV method, more SNP-pair weights 
were computed and, therefore, the X coefficient was greater compared to that from the WP method. However, for its higher the runtime efficiency, 
the AV method still allowed to significantly decrease the computational time compared to the WP method. 



MIG ++ on a genome-wide scale. We could run MIG ++ 
on the full HapMapll dataset using both the WP and 
AV methods for D' CI derivation. Using the more effi- 
cient AV method, we were also able to run a genome-wide 
haplotype block partition of the complete 1000G dataset. 
The runtime for the two datasets is shown in Figure 12. 
For HapMapll, the maximal runtime was of 1 hour when 



using the AV method and of 457 hours when using the 
WP method. In both cases, the maximal runtime was 
observed for chromosome 2, which contained 220,833 
SNPs. The median k value across all chromosomes was 
0.129 (min = 0.125, max = 0.133) for the AV method and 
0.103 (min = 0.099, max = 0.110) for the WP method. 
For the 1000G dataset, the maximal runtime using the 
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Figure 9 Haplotype block characteristics of WP and AV methods. Top panels show the number of haplotype blocks per region, middle panels 
show the median number of SNPs per block, and bottom panels show the overall percentage of SNPs in haplotype blocks. 
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Figure 1 0 Number of blocks detected with the WP method that are completely inside blocks detected with the AV method. Independently 
of the region size, from 80% to 90% of haplotype blocks detected with the WP method were completely inside haplotype blocks detected with the 
AV method. 



AV method was of 44 hours on chromosome 2, which 
contained 913,923 SNPs. The median X value across all 
chromosomes was 0.216 (min = 0.206, max = 0.224). The 
maximal memory usage was very low and didn't exceed 
151 MB and 3.6 GB for the HapMapll and 1000G datasets, 
respectively. 

Figure 13 shows the number of haplotype blocks per 
chromosome. In the HapMapll dataset, the largest num- 
ber of blocks occurred in chromosome 2: 14,164 blocks 
with the WP method and 7,482 blocks with the AV 
method. The number of blocks detected with the WP 
method was always exceeding the number of blocks 
detected with the AV method. For some chromosomes, 
the partitions obtained by the WP method contained 
almost twice as many blocks as the partitions obtained by 
the AV method. When using the AV method, we detected 
a very similar number of blocks in the HapMapll and 
1000G datasets. Across all chromosomes, 100% of the 
blocks detected with the WP method were overlapping 
blocks detected with the AV method. The median per- 
centage of the WP blocks completely covered by the AV 
blocks was 0.797 (min = 0.773, max = 0.813). 



The characteristics of the whole-genome block parti- 
tions obtained with the AV and WP methods are sum- 
marized in Table 1. The results were similar to the 
experiments on smaller regions. In the HapMapll dataset, 
fewer and larger blocks were detected with the AV method 
than with the WP method. With the AV method, a 
higher percentage of SNPs was assigned to blocks and 
the within-block haplotype diversity index was slightly 
smaller. However, the haplotype diversity was close to 
one for both methods, indicating that in both cases 
the number of possible haplotypes should be very lim- 
ited. When applying the AV-based MIG ++ algorithm to 
the 1000G dataset, we observed a higher percentage of 
SNPs in blocks and a slightly smaller diversity index, 
which is explained by the higher number of SNPs per 
block. 

For both HapMapll and 1000G datasets, the largest 
blocks were located over the chromosomal centro- 
meres and spanned tens of millions of base-pairs (bp) 
(Additional file 1, Figure A.4). Some of these very large 
blocks were characterized by very low and irregular SNP 
density. After filtering out these exceptionally large blocks, 
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Figure 1 1 Within-block haplotype diversity with WP and AV methods. AV method-based haplotype blocks showed higher haplotype diversity 
compared to WP method-based haplotype blocks. For regions with more than 1 0,000 SNPs, the median haplotype diversity within blocks was low 
(>0.8). 
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the largest block identified by the WP-based MIG ++ 
algorithm in the HapMapII dataset was located in chro- 
mosome 1, it was 1,017,844 bp long and included 398 
SNPs. When using the AV method, the largest block was 
located in chromosome 12, it was 1,190,412 bp long and 
included 335 SNPs. In the 1000G dataset, the largest block 
detected by the AV-based MIG ++ was located in chro- 
mosome 1, it was 1,361,781 bp long and included 2,896 
SNPs. 

Genome-wide association study of rheumatoid arthritis 

After the GWAS, we observed a genomic inflation factor 
k of 1.015 for the SNP-based analysis, 1.082 for the AV 
block-based analysis, and 1.077 for the WP block-based 
analysis. After GC correction, in the SNP-based analy- 
sis, 116 SNPs were genome-wide significant. Of them, 
106 were located inside 25 AV blocks and 110 inside 27 
WP blocks. From the AV and WP block-based analyses, 
we observed 29 and 33 genome-wide significant blocks, 
respectively. Twenty-three of such blocks were the same 
between the two methods. The results from the SNP- 
and block-based analyses are compared in Table 2. The 



first part of the table shows the 20 genome-wide signif- 
icant loci detected by both SNP- and block-based anal- 
yses. In most cases, the AV and WP methods brought 
to identical results. One exception was the 4 th locus, 
where two adjacent AV blocks including 6 and 14 SNPs, 
respectively, corresponded to two adjacent WP blocks of 
7 and 13 SNPs, respectively. That is, one SNP shifted 
from one block to another. In terms of significance, 
results were practically unchanged. A second exception 
was locus 13, were a block was detected only with the 
WP but not with the AV method. The last two excep- 
tions were loci number 15 and 19. In both cases, an AV 
block was split into two WP blocks. The second part 
of Table 2 shows a number of loci that wouldn't have 
been detected with a SNP-based GWAS, but were uncov- 
ered by at least one of the two block partition methods. 
The AV and WP methods produced similar results. We 
didn't observe any clear advantage of one method com- 
pared to the other. The last section of the table shows 
that there was a small number of loci uncovered only 
by the SNP-based analysis. For these loci, the p-values 
from the block-based analyses were often close to the 
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Figure 1 3 Number of haplotype blocks in the HapMapII and 1 000G datasets when the D r CIs are estimated with the WP and AV methods. 
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Table 1 Characteristics of the whole-genome haplotype block partitions obtained with the WP and AV methods 
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significance level, with the exception of the last two 
loci. 

Discussion 

We propose an algorithm for haplotype block partition- 
ing, termed MIG ++ , which represents a scalable imple- 
mentation of the Haploview algorithm and produces 
the same results in a much shorter time and using a 
substantially smaller amount of main memory. MIG ++ 
can process large DNA regions using only a handful of 
megabytes of main memory. In such situations, Haploview 
would require gigabytes. In terms of runtime, the MIG ++ 
is several times faster than Haploview. We also demon- 
strated that more than 80% of calculations were not nec- 
essary for the purpose of block recognition and could be 
omitted, thus achieving a higher efficiency. The improved 
performance of the algorithm makes it possible to process 
very large chromosomal segments. When the approxi- 
mated variance estimator, proposed by Zapata et al. [26], 
is used to estimate the D' CI, the MIG ++ can be applied 
genome-wide and process high density datasets, such as 
the 1000G, in a very short time. 

With its very small memory requirements, the MIG ++ 
can process any number of SNPs. This allowed us to avoid 
Haploviews restrictions on the maximal haplotype block 
length (the default limitation is 500Kbp) and to consider 
the LD between SNPs at any distance. Our whole-genome 
experiments showed that the haplotype blocks, based on 
the Gabriel et al. [23] definition, can span more than 
500Kbp and can extend over several millions of base pairs. 
This empirical result suggests that limiting the maximal 
block length may alter the block partition. The alteration 
can be substantial because the algorithm prioritizes the 
largest blocks. The smallest blocks are retained only when 
they do not overlap with the largest ones. For this rea- 
son, to constrain the block length within pre-specified 
limits may induce a cascade of effects and may affect the 
final partition of very large segments. This is relevant, for 



example, when assessing the LD pattern of loci selected 
from GWAS, with the aim of identifying genes related 
to the lead SNP. In such cases, different partitions could 
imply different genes to be selected for follow-up. 

With the MIG ++ algorithm, we were able to run a hap- 
lotype block recognition of the entire HapMapll dataset. 
However, it still required an unacceptably long time to 
apply the algorithm to larger and denser genomes, such 
as the 1000G dataset. This limitation is due to the use of 
the Wall and Pritchard [27] method, which models the 
\D'\ likelihood and derives the \D'\ CIs using an iterative 
procedure. In contrast, if the D f variance is estimated 
with the approximated formula suggested by Zapata et 
al. [26], it is possible to derive the D' CI with a single 
mathematical calculation. Thanks to this computationally 
less demanding solution, we could perform a complete 
block recognition of the HapMapll dataset in 1 hour 
and to process the entire 1000G dataset in 44 hours. 
To the best of our knowledge, this is the first time that 
such a marker-dense genome has been partitioned with 
a threshold-free approach. Previously, block partition of 
the whole genome could only be achieved by dividing 
chromosomes into small chunks or by restricting com- 
putations using sliding window approaches. Such choices 
may introduce artificial breaks to the real haplotype 
structure. 

It is important to note that the block partition obtained 
with an algorithm based on the AV method is not fully 
equivalent to a partition obtained based on the WP 
method. For large sample sizes and for common variants, 
the estimated variance of the D' statistic is going to be sim- 
ilar, whichever method is used. However, when crossing a 
common with a rare SNP, it often happens that one of the 
four possible haplotypes is not present in the sample. In 
such situations, it is very likely that the D' CI shrinks to 1 
because the approximated variance is zero. In this way, the 
SNP pair is systematically classified as a strong LD pair. As 
a result, SNPs with rare alleles are easily grouped together 
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into very large blocks, boosting the region coverage and 
the median number of SNPs per block. The WP method 
is less sensitive to extreme D' values, and the resulting 
blocks are generally shorter. However, we observed that 
most (80%) of the haplotype blocks obtained with the WP 
method were contained within the larger blocks obtained 
with the AV method. That is, the use of the AV method 
produced a coarser partition, where AV blocks entirely 
contained one or more WP blocks. For this reason, the AV 
blocks showed a higher haplotype diversity, in the terms 
described by Patil etal [19] and Zhang etal. [20], than the 
WP blocks. 

To provide an application of a whole-genome haplo- 
type block partition, we analyzed the data from the North 
American Rheumatoid Arthritis Consortium (NARAC) 
dataset using both block partitions: the one obtained with 
the standard WP method and the one obtained with the 
AV approach. As observed in previous studies [32,33], the 
G WAS results were dominated by the HLA locus on chro- 
mosome 6. However, other loci were identified in other 
chromosomes. For what concerns the two block parti- 
tion methods, the results were very similar, suggesting 
that the AV approach might be a convenient way to run 
a fast recognition of the haplotype blocks. However, we 
recognize that ours was an empirical application based 
on half a million genotyped SNPs. Results might be dif- 
ferent in a larger context, such as that of a GWAS based 
on the 1000 Genomes dataset, where the number of AV 
blocks is expected to be much smaller than the num- 
ber of WP blocks, and the AV blocks are expected to be 
much larger than the WP ones. Our empirical analysis of 
the NARAC data also confirmed previous observations 
that SNP- and block-based analyses are complementary to 
each other [32,34]. In fact, in our analysis some loci were 
identified only by the single-SNP analysis, other loci were 
identified only by the haplotype-block analysis, and others 
by both methods. Thus, genome-wide haplotype associa- 
tion scans are not in competition with standard GWAS. 
Genome-wide haplotype association scans should be con- 
sidered as complementary tools that may help to identify 



loci that could be overlooked by methods based on single- 
SNP analysis. We also observed that haplotype blocks 
may simplify gene annotation. While only one gene, the 
HLA-DRA [35], which was reported by previous GWASs, 
was directly implied by a genome-wide significant SNP, 
four additional previously reported genes were implied by 
genome-wide significant blocks: the APOM, HLA-DQA1, 
HLA-DRB1, and HLA-DQA2 genes [35]. 

Conclusions 

We have provided an efficient and scalable haplotype 
block recognition algorithm, termed MIG ++ , which 
improves the well-known Haploview algorithm by reduc- 
ing memory complexity from quadratic to linear and 
by omitting approximately 80% of unnecessary compu- 
tations. The improved algorithm was able to efficiently 
process dense genomic segments of any size. When 
applied to individual-level data, where genotypes are 
available, the MIG ++ efficiency can be exploited to set 
up haplotype-based (genome-wide) association scans that 
could account for the correct underlying haplotype dis- 
tribution. This seems to be especially relevant when rarer 
variants are involved. If ran on summary results from 
GWAS, the MIG ++ could help identify biologically plau- 
sible scenarios for SNP-set analysis and it could support 
a more correct annotation of genes surrounding vari- 
ants of interest. From a population-genetic point of view, 
the method could facilitate the comparison of human 
genomes across different ethnicities, helping to highlight 
structural differences. Finally, the algorithm opens up the 
possibility to integrate genome-wide LD-based haplotype 
block structure into visual assessment tools, thus improv- 
ing the interpretation of already available, but incomplete, 
LD heatmaps (Figure 14). 

The MIG algorithms are available in the LDExplorer 
R package at www.eurac.edu/LDExplorer together with 
usage instructions and examples. Further improvements 
will include application of parallel computation tech- 
niques to MIG ++ in order to further speed up the pro- 
cessing while keeping memory requirements low. 
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Figure 1 4 Visualization of the LD-based haplotype block structure of a chromosome region using the UCSC genome browser [1 7]. Top 

panel: the pairwise LD values in chr20:31 ,767,872-33,700,401 calculated by the UCSC genome browser using Haploview for SNPs within 250 kb. 
Bottom panel: LD-based haplotype blocks obtained with MIG ++ using LD between SNPs at any distance. 
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